A Chinese Corpus for Linguistic Research
نویسندگان
چکیده
The project being reported on is a sub-project of the on-going research of the CKIP (Chinese Knowledge Information Processing) Group. This group was founded by Hsieh Ching-chun in 1986 and is currently directed by Kehjiann Chen and Chu-Ren Huang (Chang et al. 1989, Hsieh et al. 1989, Chen et al. 1991). The CKIP research is divided into three sub-projects according to their goals: 1) An On-line Lexicon for NLP, 2) A Corpus, and 3) A Parser. The suit-projects are designed to create a self-sufficient and mutual supporting environment for Chinese NLI: The corpus will be the database supporting the electronic lexicon, while the lexicon will be the basic reterence for automatically tagging the corpus. Moreover, both the corpus and the lexicon will support the parser. Our parser adopts the unification-based formalism of ICG (information-based Case Grammar, Chen and Huang 199{)), which encodes all grammatical information on each lexical entry. At this point in time, the lexicon consists of a fully automated earlier version with limited grammatical information and an updated version with complete grammatical version for parsing, qtaere are more than 40 thousand entries in the completed electronic dictionary, which is available on-line in "lttiwan and allows basic pattern-matching searches. There is also a PC version with reduced search capacity available from the Industrial 'l~echnology Research Institute, the primary funding agency of this pilot dictionary project. The updated version now contains roughly 30 thousand entries with complete grammatical information and another60 thousand with basic grammatical categories. Manipulation of lexical information such as addition of entries and specification of detailed grammatical information with respect to each attribute is maintained online (Jian and Chen 1991). The completed 90 thousand word lexicon will be our core lexicon fl)r parsing. The hierarchical arrangement will enable us to efficiently add new entries and create special lexicons for sub~lomains.
منابع مشابه
Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research
We explore the possibility of deeper linguistic research based on corpus and computational linguistic tools in this paper. In particular, we adopt Chinese Word Sketch, the application of Word Sketch Engine to Chinese GigaWord Corpus, for linguistic research. We apply Chinese Sketch Engine results to deeper linguistic account such as selectional restriction and event type selection. The study is...
متن کاملThe Standard of Chinese Corpus Metadata
The normalization of corpus metadata plays a key role in building sharable corpora. However, there is no uniform specification for defining and processing metadata in Chinese corpus nowadays. This paper introduces a metadata system we’ve proposed for Chinese corpus. 46 elements are defined in all, which can be divided into 6 classes: information about copyright, information about background of ...
متن کاملTranslation and contrastive linguistic studies at the interface of English and Chinese: Significance and implications
Corpora have revolutionized nearly all areas of linguistic research over the past four decades (McEnery, Xiao and Tono 2006; McEnery and Hardie 2012). Translation studies and contrastive linguistics are no exceptions. Indeed, the rapid development of bilingual parallel corpora as well as monolingual and multilingual comparable corpora since the early 1990s has been of particular relevance and c...
متن کاملConstruction of a Chinese Opinion Treebank
In this paper, we base on the syntactic structural Chinese Treebank corpus, construct the Chinese Opinon Treebank for the research of opinion analysis. We introduce the tagging scheme and develop a tagging tool for constructing this corpus. Annotated samples are described. Information including opinions (yes or no), their polarities (positive, neutral or negative), types (expression, status, or...
متن کاملSalient Linguistic Features of Chinese Learners with Different L1s: A Corpus-based Study
The study aims to explore the salient linguistic features of Chinese lexical items from different L1s learners. The research method is corpus-based, including comparing the learner corpus and the native-speaker corpus, as well as sub-corpora for different L1s. The learner corpus which consists of more than 1.14 million Chinese words from novice proficiency to advanced learners’ texts is mainly ...
متن کاملThe Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study
This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totall...
متن کامل